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Abstract 

In this paper we consider the problem of segmenting n aligned random se¬ 
quences of equal length m into a finite number of independent blocks. We propose 
to use a penalized maximum likelihood criterion to infer simultaneously the num¬ 
ber of points of independence as well as the position of each one of these points. 
We show how to compute the estimator efficiently by means of a dynamic pro¬ 
gramming algorithm with time complexity 0{m?n). We also propose another 
algorithm, called hierarchical algorithm^ that provides an approximation to the 
estimator when the sample size increases and runs in time 0{mn). Our main 
theoretical result is the proof of almost sure consistency of the estimator and the 
convergence of the hierarchical algorithm when the sample size n grows to infin¬ 
ity. We illustrate the convergence of these algorithms through some simulation 
examples and we apply the method to real protein sequence alignments of Ebola 
Virus. 


1 Introduction 

The problem of multiple sequence segmentation and dimensionality reduction appears 
in many application areas, including the analysis of multiple alignments of DNA/RNA 
and Amino Acid (AA) sequences. These datasets are characterized by the alignment of 
sequences belonging to the same group, as genes, protein families, etc. In general the 
length of these alignments is big and direct multivariate analysis is not possible. On the 
other hand, the identihcation of relevant patterns in these datasets is crucial for many 
biological problems, as for example the identihcation of specihc patterns associated to 
some phenotypic variables. An incomplete list of methods for sequence segmentation 
and pattern identihcation of biological sequences include [SI El 12] • 

In this paper we consider the problem of segmenting n aligned random sequences 
of equal length m into a hnite number of independent blocks. To be more precise, let 
X = (Ai,..., Aat) be a random vector with probability distribution P taking values in 
Ai X ... X Atv, where A* is a hnite alphabet for alH = 1,..., A. We say j = i + with 
i G {1, 2,..., A—1}, is a point of independence for X if the random vectors (Ai,..., Xj) 
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and (Xj+i,... ,Xn) are independent. We are interested in inferring the maximal set 
of points of independence of a random vector, when the nnmber of snch points is 
nnknown. This is a model selection problem becanse the different parameter sets of 
the probability distribntions belong to vector spaces with different dimensions. We 
propose a penalized maximnm likelihood criterion to infer simnltaneonsly the nnmber 
of points of independence and their positions. Onr main theoretical resnlt is the proof 
of the almost snre convergence of the estimator to the trne set of points of independence 
when the sample size n increases. 

From the compntational point of view we show how to compnte the estimator by 
means of a dynamic programming algorithm that rnns in time 0{m?n). When the size 
of the seqnences is large this can be time consnming. In these sitnations we propose a 
snboptimal algorithm rnnning in time 0{mn) that also converges almost snrely to the 
set of points of independence when the sample size n increases. 

To onr knowledge this is the hrst model selection approach for mnltiple seqnence 
segmentation based on the assnmption of independence between segments. The change 
point detection problem is a well stndied phenomenon in the area of time series analysis 
(see for example [7] and references therein) bnt in general the setting and the goals are 
different from onrs. 

In the next section we present the main dehnitions and state the main theoretical 
resnlt. In Section [3] we describe the algorithm to compnte exactly the estimator and we 
present the hierarchical algorithm that provides an efficient bnt snboptimal solntion. 
In Section 0] we prove the theoretical resnits and in Sections [5] and [6] we present the 
simnlations and resnits on real data, respectively. 


2 Likelihood function and model selection 


Given two integers r < s denote by r: s the integer interval r, r + 1,..., s. Denote by 
Jr,s the set {i + \\ i = r, 1,..., s — 1}. We say Ur-.s C Jr-.s is a maximal set of points 
of independence for the interval r: s if no n G Jr-.s \ Ur-.s is a point of independence for 
X. For each random vector X and each interval r\s there is only one maximal set of 
points of independence; from now on this special set will be denoted by U*.g. In the 
special case r = 1, s = At we will simply write U, J and U*. 

Withont loss of generality we will also snppose that the set Ur.s C Jr.s is ordered; in 
this case Ur.s = {ui,, Uk) with Ui < Uj if i < j. From Ur.s it is possible to obtain the 
set of blocks of independent variables as the set B(Ur-.s) = {h, h, ■ ■ ■, h+i} of integer 
intervals given by 


Ji = r : (mi - i), 


h = 


Ui-l + 


[Ui - 


* = 2 , 


and 


Ik+i — {uk + |) : s. 


Given an integer interval I = r:s denote by the set of hnite seqnences ArX---xAs 
and by 1^4^| = ni=r with \Ai\ the cardinal of the set Ai. 
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In this way, given a vector of observations x = {xi,... ,xn), the likelihood function 
for the set U can be written as 

L(t/;x) = J] FiXj=xf.jeI). 

I£B(U) 

Assume we observe an i.i.d sample x*^^\ ..., x^”) of X, denoted by X. Then we have 

n 

HU; X) = n n =T ■ j -f) ■ 

i=l l£B{U) 

Denote by Xr-,s the sequence Xr,... ,Xs and by the i.i.d sample Xr!i,..., x^^j. 
Given a hnite sequence ar-.s G dehne 

n 

N{ar:s) = ^ 1 { 4:1 = ■ 

i=l 

Then X) can be rewritten as 

L{U-X) = J] J] P(X, = . (2.1) 

I&B{U) aieAi 

To proceed we need a model for the block I and some estimators for the probabilities 
P(X/ = a/). Assume that for each block I = r:s <Zl:N we have a model assignment 
M.{I) and a penalization function p(X 7 -) satisfying 

(Al) For any z G r: (s — 1) we have 

P(Xj.;^) > p(Xj.,j) T p(Xp_|_l),5) . 

We will assume that in model Xi{I) we have maximum likelihood estimators for the 
probabilities P(ar); that is some values P(a/) maximizing fl2.ip . Then we can plug-in 
these estimators in fl2.ip obtaining 

L{U-X) = J] J] . 

IeB{U) ai&Ai 

Remark 2.1. As an example we could consider for any block I the multivariate distri¬ 
bution over as possible model assignment A4{I). For this class of models we have 
that the maximum likelihood estimators are given by 

P(a,) = aieA^. 

n 

In this case, for example, we could take as penalizing function p(X/) = c(|A'^| — 1), the 
number of parameters in the multivariate distribution over A^ (in case A^ is known). 


3 






multiplied by a positive constant c. In the case is unknown a priori, it could be 
appropriate to use an empirical version of p(/); namely p(X/) = l|Xj||o — 1), 

where ||Xj||o = Ylai&Ai > 0} c is a positive constant. In some cases other 

model assignments or penalizing functions could be more efficient or parsimonious, 
mainly in the case of large blocks I. One example is to consider a Markov chain of 
some appropriate order. 

In the sequel it will be easier to work with the logarithm of the estimated likelihood 
function, for this reason we dehne 

l£B(U) 

where 

Q(/,X)= iV(a,) logP(a,). 

aj&A^ 

Now we introduce the model selection criterion based on the maximization of the 
penalized log-likelihood. 

Definition 2.2. Given a sample X dehne 

G(X) = argmax { £(t/,X)-pen(G,n) }, (2.2) 

f/C J,f/ordered 

where 

pen(t/, n) = E p(X/)log(u). 

The hrst theoretical result of this article shows that when n grows to inhnity, the 
estimator (12.2p is equal to the true set U* eventually almost surely. In other words, 
with probability one there exists no (depending on the entire sample ..., such 
that for all n > hq we have U(X.) = U*. 

Theorem 2.3. The estimator (12.2p is strongly consistent; that is 

f/(X) = U* 

eventually almost surely when n —)■ cxd. 

The proof of Theorem 12.31 is postponed to Section 01 

3 Computation of the independence set estimator 

In this section we show how to compute efficiently the penalized maximum likelihood 
estimator given in Dehnition 12.21 The hrst subsection, mostly inspired in [0], presents 
a dynamic programming algorithm that computes exactly the optimal argument of 
(12.2p . at the cost of a quadratic computing time in terms of the length of the sequences 
(see also P and references therein). In the second subsection we propose a divide and 
conquer approximation for the optimum in fl2.2p . at a more efficient computing time. 
We show that this second algorithm also retrieves the true set of points of independence 
with probability one when the sample size grows. 
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3.1 Dynamic programming algorithm 

Let Fk+i{N) denote the maximum value of the function in fl2.2p corresponding to a 
fc-dimensional vector U for the sample X; that is 

Fk+i{N) = rna,x { £{11, X) — pen(17, n) } . 

It is easy to see that the optimal /c-dimensional vector U leading to Fk+i{N) consist 
of fc — 1 independence points over 1: i and a single block (i + 1): N, where i + 1/2 is 
the rightmost point of independence. Moreover, the k blocks over 1: i must maximize 
the function fl2.2p for the sample Xi-j, attaining Fk{i). In this way, the dynamic 
programming recursion is 

Fi(iV) = Q(l:iV,X), Fu+i{N)= max { + Q((z + 1): iV, X) } , 


where Q is given by 

Q(J,X) = g(J,X)-p(X,) log(n). 

The estimator tJ(X.) in Dehnition 12.21 is computed by tabulating Fi{i) for all i up 
to N, and then by computing F 2 {i) for all i and so on up to Fn{N). The optimal value 
of k is obtained by the equation 

k = argmax{ Fk{N) } — 1. 

k=l,...,N 

and the vector 17(X) = (hi,..., h^) is given by 

itj^ = arg max { Fj^{i) + Q{{i + 1): iV, X) } + ^ , 

iti = argmax{ Fi{£) + Q{{£ + 1) :X, X) } + ^ , i = 1,... ,k - 1. 

3.2 Hierarchical Algorithm 

Here we present an efficient divide and conquer algorithm to compute the estimator 
given by Dehnition 12.21 with computational cost 0{nm). 

Let I = r: s be an integer interval (with the convention that / = 0 if s < r). Dehne 

h{I, X) = argmax{ Q{r:{i - 1),X) + Q{i:s, X) } - r - , (3.1) 

iei 2 

where also by convention we have g(0, X) = 0. 

The idea of this algorithm is to compute the best point of independence for the 
interval I (if there is one point in the interval leading to a maximum of the penalized 
likelihood or a point outside I otherwise) and then to iterate this criterion on both 
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H-algorithm 

Uo = H{r:s,X.) 

Initialization: 
t/o = 0 and / = r:s. 

Recursion: 

1. k = /i(/,X). 

2. If fc > 0 do 

compute Ui = H{r : (r + /c — i), X). 
compute U 2 = H{{r + fc + i): s, X). 
return Uq = {k + r] VlUiU U 2 . 
else break; 


Figure 1: if-algorithm. 

segments separated by this point, until no more points are detected. A summary of 
the Hierarchical algorithm is given in Figured] 

We show the almost sure convergence of the hierarchical algorithm in the following 
theorem. 

Theorem 3.1. ii(l:iV, X) = U* eventually almost surely as n ^ 00 . 

The proof of this theorem is given in the following section. 


4 Proofs 


Let / C 1: iV be an integer interval. Given two probability distributions Pi and P 2 
over let P(Pi||P 2 ) denote the Kullhack-Leihler divergence between Pi and P 2 ; that 

is 


^(^111^2) 


A (a/) log ( 

aj\P\{ai)>0 


-Pi(Q/) ^ 
P2(a/) ) ■ 


A well-know property about the Kullback-Leibler divergence between two probability 
distributions states that P = 0 if and only if Pi = P 2 . 


Lemma 4.1. For any block / C 1: A^ and any a > 0 we have that 

a log(n 
n 

eventually almost surely as n ^ 00 , where P andP represent the marginal distributions 
over the interval I. 


P(P||P) 
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Proof. Define, for a fixed aj G with P(a 7 -) > 0, the random variables 


Yi{ai) = = ai} - P(a/), i = 1, 2,..., n 


and 

n 

Zn{ai) = '^Yi{ai) = N{ai) -nP(a/). 

i=l 

The variables {Yi{ai) : i = are independent and identically distributed, 

with E(Y'i(a/)) = 0 and E.fYfai)'^) = P(a/)(1 — P(a/)). Then, by the Law of the 
Iterated Logarithm (cf. Theorem 3.52 in |1]) we have that for any e > 0 

\Zn{ai)\ < {1 +e)¥{aj){l-F{ai)) ^/2n\og\o^ (4.1) 


eventually almost surely as n —)■ oo. Dividing both sides of the last inequality by 
?7,-^P(a/) we obtain that for any 5 > 0 


\Zniai)/n\ |P(a/) - P(a/)| 


< 


(51og(?7,) 


n 


eventually almost surely as n ^ oo. Now by Lemma 6.3 in and 

|P(a,)-P(a,)|2 


c(piip) < Y1 

ai:N{aj)>0 


< \A^ 

6 log(?7,) 

\A^n 


max 

a/:P(a7)>0 


P(a/) 

|P(a,)-P(a,)p 

P(a/) 


< 


(4.2) 
we have that 


eventually almost surely as n ^ oo and this implies the result for 1^4'^ | hnite. □ 

For any j ^ I = r:s denote by Fj the probability distribution given by 

Fj{ai) = P(a,.,j)P(a(j+i):J, aj G A^. 


Lemma 4.2. Let I = r:s G l:N and suppose I C I*, with I* G B{U*); that is, there 
is no point of independence in the rage of I. Then there exists a constant 5 > 0 such 
that for any i ^ I,i ^ s we have 


Q{L,X) - Q(r:qX) - g((* + l):s;X) > 6n 


eventually almost surely when n —)■ oo. 
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Proof. Note that for any a* G we have iV(a*) = N^afalj^f) and analogonsly 

for any G iV(a|+i) = E 4 ^(a>?+i)- Then 


Q(/;X) - Q(r:z;X) 


Q((^ + l):s;X) 


Y1 ^(«r)log 


P(a^) 

P(4)P(a|+i) 


Dividing by n and taking limit when n ^ 00 we have that the expression above 
converges almost surely to 


T(4)log 


P(4) 

P(4)P(a|+i) 


D(P||P,) > 0. 


The last inequality follows because i + | is not a point of independence for X. To 
conclude the proof, it is enough to take 

5= min D(P||Pj)/2. □ 

Lemma 4.3. Let I = r\s dl'.N and suppose there exists i E I such that i + ^ E U*; 
that is, there is a point of independence in the rage of I. Then for any a > 0 we have 

Q(/;X) - Q{r:i;X) - Q{{i + 1): s;X} < «log(n) 


eventually almost surely when n ^ 00 . 

Proof. As in the proof of Lemma 14.21 we can write 

Q(J;X) - Q(r:qX) - Q((* + 1): s; X) = iV(4) log 

a«GA^ 

As P(-) is the maximum likelihood estimator of P(-) and i +1 is a point of independence 
we have that 

Q(r:qX) + Q((* + l):s;X) > iV(a^) log[P(4)P(4+J] 

= ^ /V«)log[P«)] 

Then by combining this last inequality and Lemma 14.11 we have that 


P(4) 

P(a*)P(a|+J. ■ 


Q(J;X) - g(r:*;X) 


Q((^ + 1 ): 5 ;X) < iV(4) log 

< a log(n) 


P(a: 


.T(4). 


eventually almost surely as n —)■ 00 . 


□ 
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Figure 2: Case (a): U <ZU*. 


Proof of Theorem IS.M We will show that if U C J (ordered) is such that U ^ U* then 
there exists a modihcation U' of U such that 

£(f/';X)-pen(t/',n) > £(f/; X) - pen(t/, n) (4.3) 

eventually almost surely as n oo. As the number of possible such modihcations of 
U leading to U* is hnite this implies the result. Consider the following cases 

(a) U CU* 

(b) U (f U*. 

First we will prove fl4.3p for case (a). Note that in this case it is enough to show it in 
the case U* ^ 0. Assume U* = (u*,..., u^), with K >1^U = (ui,..., Uk) and let i be 
the hrst index such that u* ^ U. Dehne U' = U U {u* } = (ui,..., Uj-i, u*, Uj+i,..., Uk) 
(see Figure E]). 

Note that B{U') has the two blocks A = (nj_i + i): {u* — \) and I 2 = («*+!): (ui —|) 
replacing the single block I = (wj-i + |): {ui — |) in BfU). Then by Lemma we 
have that for any a > 0 

£([/'; X)-£(C;X) = g(/i,X) + Q(/ 2 ,X)-Q(/,X) 

> — a log(?7,) 

eventually almost surely as n —)■ cxd. On the other hand 

pen(f/', n) — pen(f/, n) = ci log(n), 

where Ci = p(X/J +p(X/ 2 ) — p(X/) < 0. Therefore eventually almost surely as n —)■ cxo 
we have that 

> pen(t/',n)-pen(f/,n). 


9 









u* 


u 






U^-l 


Ui Ui^i 


Ur 


U -+ 


H-h 


Ui-l 

U' -h- 




Ur 


Figure 3: Case (b): U (^U*. 


Finally we will consider case (b). As before suppose U* = U = 

(mi, ... ,Mfc) and let i be the first index such that Ui ^ U*. Dehne U' = U \ {ui} = 
(mi, ..., Mj-i, Uj+i,..., Uk) (see FigurefS]). In this case we have that B{U') has the single 
block I = {ui-i + i): (uj+i — |) replacing the two blocks Ji = (uj-i + |): (n* — |) and 
h = {ui + |): {ui+i - i) in B{U). Then 

£(f/'; X) - i{U- X) = Q{I, X) - Qih, X) - Q(J2, X). 

By Lemma 14.21 as Ui ^ U* we have that there exists a constant h > 0 such that 
eventually almost surely 

i{U’;X.) - i{U;X.) >6n. 

On the other hand we have that 

pen(17', n) — pen(t/, n) = C 2 log(n), 

where C 2 = p(X/) —p(X/J —p(X/ 2 ) > 0. Therefore eventually almost surely as n ^ cx) 
we have that 

£(f/'; X) - £(f/; X) > pen(f/',u)-pen(f/,n). 

This concludes the proof of Theorem 12.31 □ 

Proof of Theorem \‘3 . 1[ First consider the case U* = 0. By Lemma 14.21 we have that 
eventually almost surely as n —)■ oo the value of i maximizing (13.Ih will be i = r, giving 
h{l : X, X) = —i. Therefore iL(l: X, X) = 0 = t/* eventually almost surely as n ^ oo. 
Now suppose there is a point of independence u & U*. We will prove that eventually 
almost surely u G iL(l: X, X). As u G t/*, for any integer r < u and any integer s > u 
we have by Lemma [4.31 that for any a > 0 

Q(r :(u-^),X) + Q((u + ^): s), X) - Q(r :s),X) > -a log(n) 
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Figure 4: Dynamic programming estimator of the set of points of independence U* = 
{5.5,10.5}, with 1^41 = 2 and for different penalizing functions. 

eventually almost surely as n —>■ cx). Then if we take a < p(Xr:s) — p(X^.(„_i)) — 
p(X„_,_i.g) we will have that eventually, as the interval length s — r decreases, h{r : 
s, X) = u — r, or equivalently u G H(r : s,X) C H{1 : iV, X). As in any iteration of 
the algorithm there is an interval that contains u and the length decreases, this will 
happen eventually almost surely as n —)■ oo. □ 


5 Simulations 

We perform some simulations and test both the dynamic programming (exact estima¬ 
tor) as well as the hierarchical algorithm. First we considered the alphabet A = {1, 2} 
and the transition matrices Pi and P 2 given by 


Pi = 


6 6 


'5 1 

A = ( f I 

^6 6 
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Figure 5: Hierarchical algorithm estimator of the set of points of independence U* = 
{5.5,10.5}, with |H| = 2 and for different penalizing functions. 

We simulated n realizations of the vector X = (Xi, X 2 ,..., X 15 ) such that Xi,..., X 5 
follow a markov chain with transition matrix Pi, Xg,... ,Xio follow a markov chain 
with transition matrix P 2 and Xu,... ,Xi 5 follow again a markov chain with transi¬ 
tion matrix Pi. The initial condition for any one of these markov chains is the uniform 
distribution over A. The results for different values of n are given in Figure H] for the 
exact estimator and in Figure O for the hierarchical algorithm. 

We performed another simulation study for the alphabet A = {1,2,3} and the 
transition matrices P} and P 2 given by 



We simulated as before n realizations of the vector X = (Xi, X 2 ,..., X 15 ) such 
that Xi,..., X 5 follow a markov chain with transition matrix P{, Xg,..., Xio follow a 
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Figure 6 : Dynamic programming estimator of the set of points of independence U* = 
{5.5,10.5}, with 1^41 = 3 for different penalizing functions. 

markov chain with transition matrix P 2 and Xu,..., X 15 follow again a markov chain 
with transition matrix P[. The initial condition for any one of these markov chains 
is the uniform distribution over A. The results for different values of n are given in 
Figure E] for the exact estimator and in Figure [7] for the hierarchical algorithm. 

The two algorithms were implemented in the open-source package R and are avail¬ 
able upon request. 

6 Dimensionality reduction in Ebola Virus protein alignments 

As an illustration of the practical use of the method proposed in this paper we apply the 
dynamic programming algorithm described in Section Elto a protein sequence alignment 
of Ebola Virus taken from the Ebola HFV Database in- 

The alignment corresponds to the matrix protein VP40, with n = 21 and m = 
326. Although the possible amino-acids in each position are 20, the maximal observed 
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Figure 7: Hierarchical algorithm estimator of the set of points of independence U* = 
{5.5,10.5}, with |H| = 3 for different penalizing functions. 


alphabet is with 5 amino-acids. Moreover, although the number of columns is 
much bigger than the number of observations, a total of 206 columns in the alignment 
shows a conserved amino-acid, with no influence in the segmentation result. 

As the number of observed amino-acids differs considerably in each position, we 
use the empirical penalizing function given by p(X/) = c(max(2, ||Xj||o) — 1), 
where ||Xj||o = J2ai&Ai > 0}- fhis case we obtained different segmentation 

patterns corresponding to different penalizing functions, that can be compared to the 
secondary structure patterns in |6] (see Figure [8]). 
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segments estimated by the dynamic programming algorithm for different values of c. 
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